Journal of Biomedical Informatics
○ Elsevier BV
Preprints posted in the last 7 days, ranked by how well they match Journal of Biomedical Informatics's content profile, based on 45 papers previously published here. The average preprint has a 0.07% match score for this journal, so anything above that is already an above-average fit.
Kim, S.; Guo, Y.; Sutari, S.; Chow, E.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.
Show abstract
Social determinants of health (SDoH) are important for clinical care, but it remains unclear how much AI-captured social context is preserved after clinician editing in ambient documentation workflows. We retrospectively analyzed 75,133 paired ambient AI-drafted and clinician-finalized note sections from ambulatory care at a large academic health system. Using a rule-based NLP pipeline, we extracted 21 SDoH categories and quantified retention, deletion, and addition. SDoH appeared in 25.2% of AI drafts versus 17.2% of final notes. At the mention level, AI captured 29,991 SDoH mentions, of which 45.1% were deleted, 54.9% were retained with clinicians adding 3,583 new mentions. Insurance and marital status were most often deleted, whereas substance use and physical activity were more often retained. Deletion patterns also varied by specialty, supporting the need for specialty-aware ambient AI systems.
Yamga, E.; Goudrar, R.; Despres, P.
Show abstract
Introduction Secondary use of electronic health records (EHRs) often requires transforming raw clinical information into research-grade data. A central step in this process is EHR phenotyping - the identification of patient cohorts defined by specific medical conditions. Although numerous approaches exist, from ICD-based heuristics to supervised learning and large language models (LLMs), the field lacks standardized benchmark datasets, limiting reproducibility and hindering fair comparison across methods. Methods We developed the MIMIC-IV Phenotype Atlas (MIPA) dataset, an adaptation of MIMIC-IV that provides expert-annotated discharge summaries across 16 phenotypes of varying prevalence and complexity. Two independent clinicians reviewed and labeled the discharge summaries, resolving disagreements by consensus. In parallel, we implemented a processing pipeline that extracts multimodal EHR features and generates training, validation, and testing datasets for supervised phenotyping. To illustrate MIPA's utility, we benchmarked four phenotyping methods : ICD-based classifiers, keyword-driven Term Frequency-Inverse Document Frequency (TF-IDF) classifiers, supervised machine learning (ML) models, and LLMs on the task. Results The final MIPA corpus consists of 1,388 expert-annotated discharge summaries. Annotation reliability was high (mean document-level kappa = 0.805, mean label-level kappa = 0.771), with 91% of disagreements resolved through consensus review. MIPA provides high-quality phenotype labels paired with structured EHR features and predefined train/validation/test splits for each phenotype. In the benchmarking case study, LLMs achieved the highest F1 scores in 13 of 16 phenotypes, particularly for conditions requiring contextual interpretation of clinical narrative, while supervised ML offered moderate improvements over rule-based baselines. Conclusion MIPA is the first publicly available benchmark dataset dedicated to EHR phenotyping, combining expert-curated annotations, broad phenotype coverage, and a reproducible processing pipeline. By enabling standardized comparison across ICD-based heuristics, ML models, and LLMs, MIPA provides a durable reference resource to advance methodological development in automated phenotyping.
Haug, M.; Ilves, N.; Umov, N.; Loorents, H.; Suvalov, H.; Tamm, S.; Oja, M.; Reisberg, S.; Vilo, J.; Kolde, R.
Show abstract
Abstract Objective To address the unresolved bottleneck of selecting cohort-relevant clinical concepts for treatment trajectory analysis in observational health data, we introduce CohortContrast, an OMOP-compatible R package for enrichment-based concept identification, temporal and semantic noise reduction, and concept aggregation, enabling cohort-level characterization and downstream trajectory analysis. Materials and Methods We developed CohortContrast and applied it to OMOP-mapped observational data from the Estonian nationwide OPTIMA database, which includes all cases of lung, breast, and prostate cancer, focusing here on lung and prostate cancer cohorts. The workflow combines target-control statistical enrichment, temporal/global noise filtering, hierarchical concept aggregation and correlation-based merging, with optional patient clustering for downstream trajectory exploration. We validated the approach with a clinician-based plausibility assessment of extracted diagnosis-concept pairs and evaluated a large language model (LLM) as an auxiliary filtering step. Results We analyzed 7,579 lung cancer and 11,547 prostate cancer patients. The workflow reduced concept dimensionality from 5,793 to 296 concepts (94.9%) in lung cancer and from 5,759 to 170 concepts (97.0%) in prostate cancer, and identified three exploratory patient subgroups in both cohorts. In a plausibility assessment of 466 diagnosis-concept pairs, validators rated 31.3% as directly linked and 57.5% as indirectly linked. Discussion CohortContrast reduces manual concept curation by prioritizing and aggregating cohort-relevant concepts while preserving clinically interpretable treatment patterns in OMOP-based real-world data. Conclusion CohortContrast enables scalable reduction of broad OMOP concept spaces into clinically interpretable, cohort-specific representations for exploratory trajectory analysis and real-world evidence research.
Koh, H. J. W.; Trin, C.; Ademi, Z.; Zomer, E.; Berkovic, D.; Cataldo Miranda, P.; Gibson, B.; Bell, J. S.; Ilomaki, J.; Liew, D.; Reid, C.; Lybrand, S.; Gasevic, D.; Earnest, A.; Gasevic, D.; Talic, S.
Show abstract
BackgroundNon-adherence to lipid-lowering therapy (LLT) affects up to half of patients and contributes substantially to preventable cardiovascular morbidity and mortality. Existing measures, such as the proportion of days covered, provide cross-sectional summaries but fail to capture the dynamic patterns of adherence over time. Although group-based trajectory modelling identifies distinct longitudinal adherence patterns, no approach currently predicts trajectory membership prospectively while incorporating patient-reported barriers. We developed BRIDGE, a barrier-informed Bayesian model to predict adherence trajectories and identify their underlying drivers. MethodsBRIDGE incorporates patient-reported barriers as structured prior information within a Bayesian framework for adherence-trajectory prediction. The model was designed not only to estimate which patients are likely to follow different adherence trajectories, but also to generate clinically interpretable probability estimates that help explain why those trajectories may arise and what modifiable factors may be most relevant for intervention. ResultsBRIDGE achieved a macro AUROC of 0.809 (95% CI 0.806 to 0.813), comparable to random forest (0.815 (95% CI 0.812 to 0.819)) and XGBoost (0.821 (95% CI 0.818 to 0.824)), two widely used machine-learning benchmarks for structured clinical prediction. Calibration was superior to random forest (Brier score 0.530 vs 0.545; ), and performance was stable across six independent training runs (AUROC SD = 0.003). Incorporating barrier-informed priors improved accuracy by 3.5% and calibration by 5.5% compared to flat priors, showing that incorporation of patient-reported barriers added value beyond electronic medical record data alone. Four clinically distinct adherence trajectories were identified: gradual decline associated with treatment deprioritisation amid polypharmacy (10.4%), early discontinuation linked to asymptomatic risk dismissal (40.5%), rapid decline associated with intolerance (28.8%), and persistent adherence (20.2%). Counterfactual analysis identified trajectory-specific intervention levers. ConclusionsBRIDGE provides accurate and well-calibrated prediction of adherence trajectories while offering clinically actionable insights into their underlying drivers. By integrating patient-reported barriers with routine clinical data, the model supports targeted, mechanism-informed interventions at the point of prescribing to improve adherence to cardioprotective therapies. FundingMRFF CVD Mission Grant 2017451 Evidence before this studyWe searched PubMed and Scopus from database inception to December 2025 using the terms "medication adherence", "trajectory", "prediction model", "Bayesian", "lipid-lowering therapy", and "barriers", with no language restrictions. Group-based trajectory modelling has consistently identified three to five adherence patterns across cardiovascular cohorts; however, these applications have been descriptive rather than predictive. Machine-learning models for adherence prediction achieve moderate discrimination but treat adherence as a binary or continuous outcome, thereby overlooking the clinically meaningful heterogeneity captured by trajectory approaches. One prior study applied a Bayesian dynamic linear model to examine adherence-outcome associations, but it did not predict adherence trajectories or incorporate patient-reported barriers. To our knowledge, no published model integrates patient-reported barriers into trajectory prediction. Added value of this studyBRIDGE is, to our knowledge, the first model to incorporate patient-reported adherence barriers as hierarchical domain-informed priors within a Bayesian framework for trajectory prediction. Using 108 predictors derived from routine electronic medical records, the model achieves discrimination comparable to state-of-the-art machine-learning approaches while additionally providing uncertainty quantification, barrier-level interpretability, and counterfactual insights to inform intervention strategies. The identified trajectories differed not only in adherence level but also in switching behaviour, drug-class evolution, and medication burden, suggesting distinct underlying mechanisms of non-adherence that may require tailored clinical responses. Implications of all the available evidenceEach adherence trajectory implies a distinct intervention target: asymptomatic risk communication for early discontinuers (40.5% of patients), proactive tolerability management for rapid decliners, medication simplification for patients with gradual decline associated with polypharmacy, and maintenance support for persistent adherers. By integrating routinely collected clinical data with patient-reported barriers, BRIDGE can be deployed within existing primary care EMR infrastructure to generate actionable, trajectory and patient--specific recommendations at the point of prescribing, helping to bridge the gap between adherence measurement and targeted adherence management.
Ferguson, D. J.
Show abstract
BackgroundClinical pharmacists, trainees, and educators rely on multi-database literature retrieval and structured evidence synthesis to answer drug-information questions. Existing workflows require navigation across PubMed, DailyMed, LactMed, interaction checkers, and specialty guideline repositories with manual de-duplication, appraisal, and synthesis. Commercial platforms that integrate these functions are costly and often unavailable in community, rural, and international training contexts. ObjectiveThis report describes the architecture of AuditMed, a single-file, browser-based clinical evidence audit platform, and reports preliminary stress-test results against a complex multi-morbidity case corpus. AuditMed is intended for research and educational use and is not a substitute for clinical judgment or validated commercial clinical decision-support systems. MethodsAuditMed integrates nineteen free, publicly available clinical and biomedical application programming interfaces into a six-stage Search [->] Select [->] Parse [->] Analyze [->] Infer [->] Create pipeline and supports browser-local patient-case ingestion with regex-based HIPAA Safe Harbor de-identification. Preliminary stress-testing was conducted against eleven cases (Cases 30 through 40) from the Complex Clinical Case Compendium Software Validation Suite, each featuring over twenty concurrent active disease states. For each case, the one-click inference pipeline was executed with default settings and the full Clinical Inference Report was captured verbatim. No retrieval-sensitivity, synthesis-fidelity, or time-to-answer endpoints were pre-specified; the exercise was qualitative and oriented toward pipeline behavior under extreme multi-morbidity. ResultsThe pipeline completed without fatal errors for all eleven cases and produced a structured Clinical Inference Report in each instance. Quantitative-finding detection performed as designed for hematologic parameters and cardiac biomarkers. Two parser defects were identified and are reproduced in the appendix: an age-as-fever regex-precedence defect affecting seven cases and a diagnosis-versus-medication parsing defect affecting one case. Evidence-linkage rate varied from zero evidence-linked statements in seven cases to eleven in one case, reflecting dependence of the inference layer on MeSH-indexed literature coverage of the specific case diagnoses. ConclusionsAuditMed is an early-stage, open-source platform whose value at this stage is in providing a free, transparent, auditable workflow for multi-source evidence synthesis with explicit uncertainty flagging. The preliminary results document both robust end-to-end completion under extreme case complexity and specific, reproducible parser defects that will be addressed before formal evaluation. Planned evaluation studies are described.
Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.
Show abstract
Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.
Van, T. A.
Show abstract
BackgroundType 2 diabetes mellitus (T2DM) is a leading global public health challenge. Machine learning (ML) combined with Explainable AI (XAI) is increasingly applied to T2DM risk prediction, but the field lacks a quantitative overview of methodological trends and integration gaps. MethodsWe present a structured synthesis and critical analysis of the XAI literature on T2DM risk prediction, combining (i) quantitative bibliometric analysis of a two-database corpus (N = 2,048 documents from Scopus and PubMed/MEDLINE, deduplicated via a transparent three-tier pipeline) and (ii) an in-depth selective review of 15 highly cited papers. Reporting follows PRISMA 2020, adapted for metadata-based synthesis; analyses include keyword frequency, rule-based thematic clustering, and publication trend analysis. ResultsThe field grew rapidly, from 36 documents (2020) to 866 (2025). SHAP and LIME dominate XAI methods; XGBoost and Random Forest dominate ML models. Critically, KG/GNN terms appeared in only 17 documents ([~]0.83%) compared with 906 for XAI methods, a 53.3:1 disparity. This gap is consistent across both databases, which share 33.2% of their records, ruling out a single-database artifact. The selective review confirmed that none of the 15 highly cited papers combined all three components, ML, XAI, and KG, in T2DM risk prediction. ConclusionsThe XAI for T2DM risk prediction field exhibits a clinical interpretability gap: statistical explanations are rarely linked to structured clinical pathways. We propose a three-layer conceptual framework (Predictive [->] Explainability [->] Knowledge) that integrates KG as a supplementary semantic layer, with potential applications in clinical decision support and population-level screening. The framework does not perform true causal inference but structures explanations around established pathophysiological knowledge. This study contributes a transferable methodology and a quantified research gap to guide future work integrating ML, XAI, and structured medical knowledge.
Yamga, E.; Murphy, S.; Despres, P.
Show abstract
Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.
Lin, T.; Li, Y.; Huang, Z.; Gui, T. T.; Wang, W.; Guo, Y.
Show abstract
Target trial emulation (TTE) offers a principled way to estimate treatment effects using real-world observational data, but analyses of time-varying treatment strategies remain vulnerable to immortal time bias. The clone-censor-weight (CCW) approach is increasingly used to address this problem, yet key aspects of its causal interpretation and implementation remain unclear. In this work, we emulate a target trial using electronic health records (EHRs) to compare completion of a 3-dose 9-valent human papillomavirus vaccination (HPV) series within 12 months versus remaining partially vaccinated among vaccine initiators. We link CCW to the classic potential outcome framework in causal inference, evaluate the role of different weighting mechanisms, and account for within-subject correlation induced by cloning using cluster-robust variance estimation. Our study provides practical guidance for applying CCW in real-world comparative effectiveness studies to address immortal time bias and supports more rigorous and interpretable treatment effect estimation in TTE.
Gjertsen, M.; Yoon, W.; Afshar, M.; Temte, B.; Leding, B.; Halliday, S.; Bradley, K.; Kim, J.; Mitchell, J.; Sanders, A. K.; Croxford, E. L.; Caskey, J.; Churpek, M. M.; Mayampurath, A.; Gao, Y.; Miller, T.; Kruser, J. M.
Show abstract
Importance: Physicians routinely prognosticate to guide care delivery and shared decision making, particularly when caring for patients with critical illnesses. Yet, these physician estimates are prone to inaccuracy and uncertainty. Artificial intelligence, including large language models (LLMs), show promise in supporting or improving this prognostication. However, the performance of contemporary LLMs in prognosticating for the heterogeneous population of critically ill patients remains poorly understood. Objective: To characterize and compare the performance of LLMs and physicians when predicting 6-month mortality for hospitalized adults who survived critical illness. Design: Embedded mixed methods study with elicitation and comparison of prognostic estimates and reasoning from LLMs and practicing physicians. Setting: The publicly available, deidentified Medical Information Mart for Intensive Care (MIMIC)-IV v2.2 dataset. Participants: We randomly selected 100 hospitalizations of adult survivors of critical illness. Four contemporary LLMs (Open AI GPT-4o, o3- and o4-mini, and DeepSeek-R1) and 7 physicians provided independent prognostic estimates for each case (1,100 total estimates; 400 LLM and 700 physician). Main outcomes and measures: For each case, LLMs and physicians used the hospital discharge summary and demographics to predict 6-month mortality (yes/no) and provide their reasoning (free text). We assessed prognostic performance using accuracy, sensitivity, and specificity, and used inductive, qualitative content analysis to characterize reasonings. Results: Mean physician accuracy for predicting mortality was 70.1% (95% CI 63.7-76.4%), with sensitivity of 59.7% (95% CI 50.6-68.8%) and specificity of 80.6% (95% CI 71.7-88.2%). The top-performing LLM (OpenAI o4-mini) accuracy was 78.0% (95% CI 70.0-86.0%), with sensitivity of 80.0% (95% CI 67.4-90.2%) and specificity of 76.0% (95% CI 63.3-88.0%). The difference between mean physician and top-performing LLM accuracy was not statistically significant (p = 0.5). Qualitative analysis revealed similar patterns in LLM and physician expressed reasoning, except that physicians regularly and explicitly reported uncertainty while LLMs did not. Conclusion and Relevance: In this study, LLMs and physicians achieved comparable, moderate performance in predicting 6-month mortality after critical illness, with similar patterns in expressed reasoning. Our findings suggest LLMs could be used to support prognostication in clinical practice but also raise safety concerns due to the lack of LLM uncertainty expression.
Van Oyen, C.; Mirza-Haq, N.
Show abstract
MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.
Martin, C. M.; henderson, i.; Campbell, D.; Stockman, K.
Show abstract
Background: The instability-plasticity framework proposes that multimorbidity trajectories periodically enter instability phases that are vulnerable to escalation but also potentially modifiable through relational intervention. Whether such phases commonly resolve without acute care, or predominantly progress to hospitalisation, has not been quantified at scale. Objective: To quantify instability window outcomes across a longitudinal monitoring cohort; to test whether the characteristics distinguishing admitted from resolved windows reflect within-patient trajectory dynamics or between-patient severity; and to characterise which patient-reported and operator-rated signals reliably precede admission, using both a curated pilot sub-cohort and the full monitoring cohort with an explicit cross-cohort comparison. Methods: Two complementary analyses were conducted on data from the MonashWatch Patient Journey Record (PaJR) relational telehealth system. Instability windows were identified algorithmically (>=2 consecutive calls with Total_Alerts >=3) across the full longitudinal dataset (16,383 calls, 244 patients, 2.5 years) and classified by linkage to ED and hospital admission data. Window characteristics were compared at window, patient, and paired within-patient levels. Pre-admission signal cascades were analysed in two configurations: a curated pilot sub-cohort (64 patients, 280 calls, +/-10-day window, 103 admissions, December 2016-September 2017) and the full monitoring cohort (175 patients, 1,180 pre-admission calls, +/-14-day window, December 2016-July 2019). A three-way cross-cohort comparison decomposed differences between the two configurations into pipeline and population effects. Results: 621 instability windows were identified across 157 patients (64% of the monitored cohort). 67.3% resolved without hospital admission or ED attendance, a rate stable across alert thresholds 1-5. In paired within-patient analysis (n = 70), duration in days (p = 0.002) and multi-domain breadth (p < 0.001) distinguished admitted from resolved windows; alert intensity did not. In the pilot sub-cohort, patient-reported illness prognosis (Q21) was the dominant pre-admission signal (GEE beta = +0.058, AUC = 0.647, p-BH = 0.018). This finding did not replicate in the full cohort: Q21 was non-significant (GEE beta = -0.008, p = 0.154, AUC = 0.507). Cross-cohort analysis identified selective curation of the pilot sub-cohort as the primary explanation. In the full cohort, six signals escalated significantly before admission after Benjamini-Hochberg correction: total alerts, health impairment (Q26), red alerts, self-rated health (Q3), patient concerns (Q1), and operator concern (Q34). Health impairment achieved the highest individual AUC (0.605) and showed the longest pre-admission lead. No individual signal exceeded AUC 0.61. Conclusions: Two thirds of instability phases resolve without hospitalisation, providing direct empirical support for trajectory plasticity as a clinically frequent phenomenon. Within the same patient, persistence - in duration and in the consistency of high-severity multi-domain flagging across calls - distinguishes trajectories that tip into admission from those that resolve. The Q21 signal reversal between cohorts illustrates how selective curation can produce compelling but non-replicable findings in monitoring research. In the full population, objective alert signals and operator judgement, rather than patient illness prognosis, carry the pre-admission signal
Sreekanth, J.; Salgado-Baez, E.; Edel, A.; Gruenewald, E.; Piper, S. K.; Spies, C.; Balzer, F.; Boie, S. D.
Show abstract
Routine ICU data offers valuable insights into daily physiological rhythms. While traditional methods assume these cycles maintain fixed periods and amplitudes, their inherent variability requires dynamic estimation of instantaneous trends. Wavelet transform effectively resolves circadian oscillations, especially for frequently measured vital parameters. We present novel extensions to the Continuous Wavelet Transform (CWT) power spectral analysis to better detect and segment subtle temporal patterns. Using this approach, we uncover hidden circadian patterns in cardiovascular vitals such as Heart Rate (HR) and Mean Blood Pressure (MBP) measured over five days in a retrospective cohort of 855 ICU patients. By quantifying non-stationary rhythms, we identified diurnal and semi-diurnal oscillations varying in period and power according to delirium and deep sedation. Notably, HR exhibits a clear diurnal and semi-diurnal rhythm when delirium is absent. Overall, our framework supports the CWT as a powerful tool for analyzing complex physiological signals, particularly vital signs. Crucially, our findings suggest that cardiovascular rhythm disruption can be associated with ICU-related delirium and deep sedation.
Ytsma, C. R.; Torralbo, A.; Fitzpatrick, N. K.; Pietzner, M.; Louloudis, I.; Nguyen, D.; Ansarey, S.; Denaxas, S.
Show abstract
Objective The aim of this study was to develop and validate an automated, scalable framework to harmonise fragmented UK primary care prescription records into a research-ready dataset by mapping four diverse medical ontologies to a unified, historically comprehensive reference standard. Materials and Methods We used raw prescription records for consented participants in the UK Biobank, in which participants are uniquely characterized by multiple data modalities. Primary care data were preprocessed by selecting one drug code if multiple were recorded, cleaning codes to match reference presentations, expanding code granularity based on drug descriptions, and updating outdated codes to a single reference version. Harmonisation entailed mapping British National Formulary (BNF) and Read2 codes to dm+d, the universal NHS standard vocabulary for uniquely identifying and prescribing medicines. Harmonised dm+d records were then homogenised to a single concept granularity, the Virtual Medicinal Product (VMP). We validated our methods by creating medication profiles mapping contemporary drug prescribing patterns in 312 physical and mental health conditions. Results We preprocessed 57,659,844 records (100%) from 221,868 participants (100%). Of those, 48,950 records were dropped due to lack of drug code. 7,357,572 records (13%) used multiple ontologies. Most (76%) records were encoded in BNF and most had the code granularity expanded via the drug description (N=28,034,282; 49%). 41,244,315 records (72%) were harmonised to dm+d and 99.98% of these were converted to VMP as a homogeneous dataset. Across 312 diseases, we identified 23,352 disease-drug associations with 237 medications (represented as BNF subparagraphs) that survived statistical correction of which most resembled drug - indication pairs. Conclusion Our methodology converts highly fragmented and raw prescription records with inconsistent data quality into a streamlined, enriched dataset at a single reference, version, and granularity of information. Harmonised prescription records can be easily utilised by researchers to perform large-scale analyses in research.
Swee, S.; Adam, I.; Zheng, E. Y.; Ji, E.; Wang, D.; Speier, W.; Hsu, J.; Chang, K.-W.; Shivkumar, K.; Ping, P.
Show abstract
Ambulatory electrocardiograms (ECG) provides continuous monitoring of the hearts electrical activity. However, many existing machine learning and artificial intelligence models for analyzing ambulatory ECG traces are often unimodal and do not incorporate patient clinical context. In this study, we propose a multimodal framework integrating ambulatory ECG-derived representations with clinical text embeddings to predict two cardiac outcomes: sudden cardiac death and pump failure death. Ambulatory ECG traces are preprocessed, segmented, and encoded via a multiple instance learning and temporal convolutional neural network framework. In parallel, patient clinical features are parsed into structured prompts, which are passed through a large language model to generate clinical reasoning; this reasoning passes through a biomedical language encoder to generate a text embedding. With the ECG and text embeddings, we systematically evaluate multiple fusion strategies, including concatenation- and gating-based approaches, to integrate these two data modalities. Our results demonstrate that multimodal models consistently outperform unimodal baselines, with adaptive fusion mechanisms providing the greatest improvements in predictive performance. Decision curve analysis highlights the potential clinical utility of the proposed framework for risk stratification. Finally, we visualize model attention across modalities, including ECG attention patterns, segment-level saliency, heart rate variability features, and clinical reasoning, to contextualize patient-specific predictions.
TRIPATHI, H.; Roy, K.; Rahimi, S.; Neupane, S.; Bozorgzad, S.
Show abstract
Sepsis is a leading cause of in-hospital mortality, yet systematically evaluating temporal adherence to the Surviving Sepsis Campaign (SSC) bundle across large patient populations remains difficult due to semantic variability in electronic health records and the loss of clinical nuance inherent in binary pass/fail compliance judgments. We present an expert-guided neuro-symbolic pipeline that pairs LLM-based semantic normalization with a Sugeno fuzzy inference system encoding eight SSC bundle rules, producing graded per-episode compliance scores whose clinical decision boundaries are set through domain expert consultation. Applied to 2,438 sepsis episodes from MIMIC-IV v3.1, the dual-classifier normalization layer achieves substantial inter-system agreement with high embedding-based confirmation, resolving hundreds of clinically relevant drug strings that purely symbolic systems miss. The graded framework reveals that Hour-1 bundle failures, particularly antibiotic timing, are the dominant driver of low overall compliance, and that higher bundle adherence is associated with notably shorter ICU stays, with antibiotic delays beyond six hours increasing median stays by 61%. These results demonstrate that neuro-symbolic graded assessment can surface actionable compliance patterns that binary evaluation frameworks cannot capture.
Mahmud, S.; Akter, M. S.; Ahamed, B.; Rahman, A. E.; El Arifeen, S.; Hossain, A. T.
Show abstract
Background Depressive symptoms among reproductive-aged women represent a major public health concern in low- and middle-income countries, yet systematic screening remains limited. In most population survey datasets, the low prevalence of depression results in severe class imbalance, which challenges conventional machine learning models. Therefore, we develop and evaluate a bagging-based ensemble machine learning framework to predict depressive symptoms among reproductive-aged women using highly imbalanced Bangladesh demographic and health survey (BDHS) 2022 data. Methods The sample comprised women aged 15-49 years drawn from BDHS 2022 data. Depressive symptoms were defined using the Patient Health Questionnaire (PHQ-9 [≥]10). Candidate predictors were drawn from sociodemographic, reproductive, nutritional, psychosocial, healthcare access, and environmental domains. Feature selection was performed using Elastic Net (EN), Random Forest (RF), and XGBoost model. Five classifiers (EN, RF, Support Vector Machine (SVM), K-nearest neighbors (KNN), and Gradient Boosting Machine (GBM)) were trained using both oversampling-based approaches and the proposed ensemble framework. Model performance was evaluated on an independent test set using accuracy, sensitivity, specificity, F1-score, and the normalized Matthews correlation coefficient (normMCC). Results Approximately 4.8% of women were identified with depressive symptoms. The proposed bagging ensemble framework consistently achieved more balanced predictive performance than oversampling-based models. Average normMCC improved from 0.540 (oversampling) to 0.557 (ensemble). RF and GBM ensembles demonstrated notable improvements in identifying depressive cases, while the EN ensemble achieved the highest overall performance and sensitivity. Threshold optimization yielded stable normMCC across models, indicating robust trade-offs between sensitivity and specificity. Conclusions Bagging-based ensemble learning provides a more robust and balanced approach than synthetic oversampling for predicting depressive symptoms in highly imbalanced population survey data. This approach has important implications for improving early identification and population-level mental health surveillance in resource-constrained settings.
Devadiga, A.; Singh, P.; Sankar, J.; Lodha, R.; Sethi, T.
Show abstract
Temporal resolution of physiological monitoring in intensive care varies widely across healthcare systems. Artificial intelligence models assume a uniform and fixed frequency of sampling, thus limiting the generalizability of models, especially to resource-limited settings. Here, we propose a novel resolution-transfer task for physiological time series and ask whether models trained on high-resolution data can generalize to a low data-density setting without the need to retrain them. SafeICU, a novel longitudinal pediatric intensive care dataset spanning ten years from a tertiary care hospital in India, was used to test this hypothesis. Self-supervised transformer models were trained on 144,271 patient-hours of high-resolution physiological signals from 984 pediatric ICU stays to learn representations of heart rate, respiratory rate, oxygen saturation, and arterial blood pressure. Transfer of this model to low-resolution data established robust performance in clinically relevant lower-frequency intervals, consistently outperforming models trained directly at coarser resolutions. Further, these representations generalized across patient populations, maintaining performance when evaluated on adult intensive care cohorts from the MIMIC-III and eICU databases without retraining. In a downstream task of early shock prediction, models achieved strong discrimination in the pediatric cohort (area under the receiver operating characteristic curve (AUROC) 0.87; area under the precision-recall curve (AUPRC) 0.92) and retained stable performance across monitoring intervals from 10 to 60 minutes (AUROC 0.78-0.88). Together, these results demonstrate that physiological representations learned from high-resolution data enable time-scale-robust and transferable AI for intensive care. The publicly released SafeICU dataset, comprising longitudinal vital signs, laboratory measurements, treatment records, microbiology, and admission and discharge, provides a foundation for developing and deploying generalizable clinical AI in resource-limited settings.
Kheirbakhsh, R.; Mathur, P.; Lawlor, A.
Show abstract
Multimodal machine learning leverages complementary information from diverse data sources and has shown strong promise in medical imaging, where multimodal data is critical for clinical decision making. In glioma grading, integrating MRI modalities with clinical data can improve diagnostic accuracy, yet systematic comparisons of fusion strategies remain limited. This study evaluates early, intermediate, and late fusion approaches, addressing the question: How does the inclusion of clinical data alongside MRI modalities influence grading performance? To assess modality contributions, we design adaptable fusion layers and employ interpretability techniques, including attention-based analysis. Our results show that incorporating clinical data consistently outperforms unimodal and MRI-only baselines, with intermediate fusion yielding the most reliable gains. Beyond accuracy, the framework reveals how MRI and clinical features jointly shape predictions, underscoring the importance of both fusion design and interpretability for clinical adoption.
Hakata, Y.; Oikawa, M.; Fujisawa, S.
Show abstract
Who is affectedIn Japan, approximately 100 million chest radiographs (CXRs) are acquired annually, while only about 7,000 board-certified diagnostic radiologists practice nationwide (Japan Radiological Society workforce statistics; OECD Health Statistics, most recent available year). This implies an average workload exceeding 10,000 imaging studies per radiologist per year if all CXRs were attributed to board-certified diagnostic radiologists (an upper-bound estimate, because in practice many CXRs are primarily read by non-radiologist physicians). In settings such as night shifts, weekends, remote islands, and regional care networks, non-radiologist physicians frequently act as primary readers. Despite strong demand for AI assistance, existing systems are typically limited by one of three shortcomings -- poor cross-institutional generalization, limited interpretability, or inability to generate draft reports -- and consequently see limited clinical deployment. What we builtWe propose a Box-Latent Trinity that embeds each image as a hyperrectangle parameterized by a center c and a radius r, rather than as a single point in a latent space. We further introduce BL-TTA (Box-Latent Test-Time Augmentation), which approximately closes the train-inference gap (exact in the N [->] {infty} limit; N = 8 suffices in practice) by averaging predictions over samples drawn from within the latent box at inference time. Both components are implemented on top of the frozen MedSAM2 medical imaging foundation model. A single box representation simultaneously supports three functions: (A) theoretically grounded source selection, (B) device-invariant augmentation, and (C) case-based retrieval-augmented generation (RAG). Each prediction is accompanied by retrieved similar prior cases, a calibrated confidence estimate, and clinical-guideline references. How well it performsOn the Open-i CXR corpus (2,954 image-report pairs) under a patient-level 80/10/10 split and 5-seed reproducibility, the full system B5 achieves macro area under the receiver-operating-characteristic curve (macro-AUROC) 0.639 (best-seed test; 5-seed mean 0.626, Table 2; absolute +0.015 over the strongest same-backbone baseline, Merlin-style 0.624), elementwise accuracy 0.753 (absolute +0.072 over Merlin-style 0.681 -- equivalent to approximately 7 fewer label-level errors per 100 (label, image) predictions across 14 finding labels, not per 100 images), and report label-F1 0.435 (absolute +0.086, relative +25 % over the strongest same-backbone report-generation baseline, Bootstrapping-style 0.349). Under simulated pixel-space device-shift intensities up to twice the training distribution, AUROC degrades by only 0.014. Brier score (macro) is 0.061; Cohens{kappa} between two independent rule-based label extractors is 0.702 (substantial agreement); the box radius yields an out-of-distribution (OOD) detection AUROC of 0.595; and the framework provides four structural explainable-AI (XAI) outputs -- retrieved similar cases, confidence tier, per-axis uncertainty, and visual saliency -- which we jointly quantify in a single CXR study, a combination that, to our knowledge, has not been reported previously. O_TBL View this table: org.highwire.dtl.DTLVardef@d8ced6org.highwire.dtl.DTLVardef@1f3471dorg.highwire.dtl.DTLVardef@c1c2f1org.highwire.dtl.DTLVardef@e589bdorg.highwire.dtl.DTLVardef@1b5e410_HPS_FORMAT_FIGEXP M_TBL C_TBL Path to deploymentBecause the complete experiment can be reproduced in under two hours on a consumer-grade GPU (NVIDIA RTX 4060, 8 GB VRAM), the framework can run on compute resources already available at typical healthcare institutions. The approach thus supports the practical delivery of evidence-grounded diagnostic support to night shifts, remote-island care, and secondary readings in health checkups -- settings in which a board-certified radiologist is not locally available. One-sentence summaryReproducible end-to-end in under two hours on a single consumer-grade GPU, the proposed framework outperforms the strongest same-backbone medical-AI baselines on three principal metrics, maintains accuracy under simulated device shifts, and automatically drafts evidence-grounded radiology reports, offering a reproducible and compute-efficient direction toward reducing the reading burden of Japanese radiologists, subject to external validation.